Zurich by the Numbers - Predictive Insights into Tourism Dynamic

Authors
Affiliation

Name I, First Name I

University of Lausanne

Name II, First Name II

Published

May 8, 2024

Abstract

The following Forecasting project focuses on applying forecasting techniques to predict tourism trends in Zurich. This analysis aims to harness the power of historical data combined with forecasting algorithms to provide actionable insights into future tourism patterns. We engage in comprehensive data preparation, explore various predictive models, and conduct a detailed evaluation of their forecasting accuracy. The project encapsulates the challenge of turning complex data into understandable and strategic information, crucial for effective decision-making in Zurich’s tourism sector.

1 Exploration & Visualization

1.1 Objectives

The main objectives of this project is to predict :

  • The overnight stays of the visitors in Vaud, from October 2023 until December 2024.

  • The overnight satys of visitors from Philippines to Zürich, from October 2023 until December 2024.

2 DATA

2.1 Cleaning & Wrangling

2.1.1 Tourism Data - All

The dataset contains information about the overnights stays by tourists in the various Swiss cantons. It indicates the tourist’s country of origin, the canton of stay, the month, the year and the total number of overnights stays.

As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.

Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)
tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))

#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#print unique values in month column
unique(tourism_data$Monat)
#>  [1] "Januar"    "Februar"   "März"      "April"     "Mai"      
#>  [6] "Juni"      "Juli"      "August"    "September" "Oktober"  
#> [11] "November"  "Dezember"
# change ' [1] "Januar"    "Februar"   "März"      "April"     "Mai"       "Juni"      "Juli"      "August" "September" "Oktober"   "November"  "Dezember" into english month'
tourism_data$Monat <- tourism_data$Monat %>% recode_factor(
  "Januar" = "January",
  "Februar" = "February",
  "März" = "March",
  "April" = "April",
  "Mai" = "May",
  "Juni" = "June",
  "Juli" = "July",
  "August" = "August",
  "September" = "September",
  "Oktober" = "October",
  "November" = "November",
  "Dezember" = "December"
)
#add date type column for plotting purposes
tourism_data <- tourism_data %>% mutate(Date = dmy(paste("01", Monat, Jahr)))
# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total
tourism_data_no_total <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")
#check for NAN
sum(is.na(tourism_data_no_total))
#> [1] 51395
#analyse the NAN values, where are they
(tourism_data_no_total %>% filter(is.na(value)))
#> # A tibble: 51,395 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Schwe~ Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Schwe~ Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Schwe~ Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Schwe~ Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Schwe~ Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Schwe~ Janu~ 2005     NA 2005-01-01
#> # i 51,385 more rows
#show data using reactable only showing the first 100 rows
reactable::reactable(head(tourism_data_no_total, 1000))

2.1.2 Tourism Data - Vaud

Click to show code
# Filter by canton Vaud 
tourism_vaud <- tourism_data %>% filter(Kanton == "Vaud")
#check for NAN
sum(is.na(tourism_vaud))
#> [1] 1869
#show the data in a table using reactable
reactable::reactable(head(tourism_vaud, 20))

2.1.3 Tourism Data - Zurich

We filtered the ‘Canton’ column to keep only the canton of Zürich.

Click to show code
#filter column 'Kanton' for Zurich
tourism_data_zurich <- tourism_data_no_total %>% filter(Kanton == "Zürich")
#check for NAN
sum(is.na(tourism_data_zurich))
#> [1] 1869
#analyse the NAN values, where are they
tourism_data_zurich %>% filter(is.na(value))
#> # A tibble: 1,869 x 6
#>    Herkunftsland                  Kanton Monat Jahr  value Date      
#>    <chr>                          <chr>  <fct> <chr> <dbl> <date>    
#>  1 Malta                          Zürich Janu~ 2005     NA 2005-01-01
#>  2 Zypern                         Zürich Janu~ 2005     NA 2005-01-01
#>  3 Mexiko                         Zürich Janu~ 2005     NA 2005-01-01
#>  4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005     NA 2005-01-01
#>  5 Bahrain                        Zürich Janu~ 2005     NA 2005-01-01
#>  6 Katar                          Zürich Janu~ 2005     NA 2005-01-01
#>  7 Kuwait                         Zürich Janu~ 2005     NA 2005-01-01
#>  8 Australien                     Zürich Janu~ 2005     NA 2005-01-01
#>  9 Neuseeland, Ozeanien           Zürich Janu~ 2005     NA 2005-01-01
#> 10 Oman                           Zürich Janu~ 2005     NA 2005-01-01
#> # i 1,859 more rows

#show the data in a table using reactable
reactable::reactable(head(tourism_data_zurich, 1000))

2.1.4 Tourism Data - Zurich and Philipines

Same as above, but we’re also filtering the country of origin, keeping only ‘Philippinen’ as that is what we’re aiming for.

Click to show code
tourism_data_zurich_philippines <- tourism_data_zurich %>% filter(Herkunftsland == "Philippinen")
#show table using reactable
reactable::reactable(tourism_data_zurich_philippines)

2.1.5 Deal with NAN

We have none in the data filtered with zurich and philippines, but if we would have we would :

2.1.5.1 Impute missing values ARIMA

If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.

Click to show code
# #Creating a tsibble with missing values
# data <- tourism_data_zurich_philippines %>%
#   as_tsibble(key = c(Kanton, Herkunftsland, Monat, Jahr)) %>%
#   select(Date, value) %>%
#   fill_gaps()
# 
# # Fit an ARIMA model to data with missing values
# model_fit <- data %>%
#   model(ARIMA(value))
# 
# # Interpolate missing values using the fitted ARIMA model
# filled_data <- model_fit %>%
#   interpolate(data)
# 
# # Print the data with filled in missing values
# print(filled_data)

3 EDA - Vaud

3.1 Visitors from different countries in Vaud

Click to show code
plot_vaud <- tourism_vaud %>% 
  filter(Herkunftsland != 'Herkunftsland - Total') %>%
  ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,
                      text = paste("Country:", Herkunftsland, "Trips:", value))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) + 
  labs(title = "Number of visitors from Each Country to Vaud",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(plot_vaud, tooltip = "text")

# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
         legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2))  # Adjust legend position

# Display the interactive plot
interactive_plot

According to the graph the most tourist in canton Vaud are Swiss and France.

Graphical view of total number of tourists in canton Vaud

Click to show code
head(tourism_vaud)
#> # A tibble: 6 x 6
#>   Herkunftsland         Kanton Monat   Jahr  value Date      
#>   <chr>                 <chr>  <fct>   <chr> <dbl> <date>    
#> 1 Herkunftsland - Total Vaud   January 2005  57942 2005-01-01
#> 2 Schweiz               Vaud   January 2005  27853 2005-01-01
#> 3 Baltische Staaten     Vaud   January 2005    103 2005-01-01
#> 4 Deutschland           Vaud   January 2005   3052 2005-01-01
#> 5 Frankreich            Vaud   January 2005   7667 2005-01-01
#> 6 Italien               Vaud   January 2005   2154 2005-01-01
tourism_vaud_total <- tourism_vaud %>%
  filter(Herkunftsland == 'Herkunftsland - Total') %>%
  select(-c(Herkunftsland, Kanton, Monat, Jahr))
tourism_vaud_total%>%
  ggplot(aes(x = Date, y = value)) +
  geom_line() +
  labs(x = "Date", y = "Number of tourists", title = "Total number of tourists in canton Vaud") + 
  theme_minimal()

3.1.1 Decomposition

Click to show code
# Convert data to a time series object
vaud_ts <- tourism_vaud_total %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Decompose the time series
vaud_ts %>% decompose() %>% plot()

We see clear seasonality with picks in summer and dip in January.

We saw a big drop in March 2020 due to the global coronavirus pandemic.

4 EDA - Zurich

4.1 Zurich and All visiting countries

Click to show code
# Preparing the data
#removing value 'Schweiz' in column 'Herkunftsland' as it is just the whole of Switzerland
data <- tourism_data_zurich %>%
  filter(!is.na(value)) %>%  # Removing rows with NA values in the 'value' column
  mutate(Monat = month(Date, label = TRUE, abbr = TRUE),  # Extract month from Date
         Jahr = year(Date)) %>%  # Extract year from Date
  group_by(Herkunftsland, Date) %>%  # Group by country and date
  summarise(Trips = sum(value), .groups = 'drop')  # Summing up trips for each country per date

p <- ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,
                      color = Herkunftsland == "Philippinen",
                      text = paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +  # Added text for tooltip
  geom_line(show.legend = FALSE) +
  scale_color_manual(values = c("TRUE" = "red", "FALSE" = "grey")) +
  labs(title = "Number of Trips from Each Country to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()

# Convert to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = "text")

# Adjust plotly settings 
interactive_plot <- interactive_plot %>%
  layout(margin = list(l = 60, r = 60, b = 60, t = 80),  # Adjust margins
         legend = list(orientation = "h", x = 0, xanchor = "left", y = -0.2),
         width = 600,
         height = 400)  # Adjust legend position
#> Warning: Specifying width/height in layout() is now deprecated.
#> Please specify in ggplotly() or plot_ly()
# Display the interactive plot
interactive_plot

4.2 Zurich and Philipines Visitors

This graph shows changes in the number of overnight stays between January 2005 and September 2023.

Germans are the most frequent visitors to Zurich. From 2022 onwards, the United States will be the first country to visit Zurich. The red curve representing the Philippines is flat and very close to 0 (little volatility), indicating a considerably lower number of trips from this country to Zurich over the entire period observed.

A drastic drop can be observed. This could be due to the COVID-19 pandemic, which had a significant impact on international travel. At first glance, we can observe regular seasonal peaks for most countries, suggesting the presence of seasonality in tourism to the canton of Zurich.

4.3 Zurich and Philippinens Visitors

Click to show code
# use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axis
p <- ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +
  geom_line() +
  labs(title = "Number of Trips from Philipines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()
p

  • Slight seasonality

  • General upward trend in the number of nights spent before 2020.

  • Very sharp fall in 2020 - covid

  • General upward trend after the fall

4.3.1 Pattern

4.3.1.1 Decompose

Click to show code
# Convert data to a time series object
tourism_ts <- tourism_data_zurich_philippines %>%
  arrange(Date) %>%
  # Ensure data is complete and monthly
  complete(Date = seq.Date(min(Date), max(Date), by = "month")) %>%
  replace_na(list(value = 0)) %>%  # Replace NA values if there are any
  # Create a time series object
  with(ts(value, frequency = 12, start = decimal_date(min(Date))))

# Decompose the time series
decomposed <- decompose(tourism_ts)

# Plot the decomposed components
plot(decomposed)

4.3.1.2 Seasonality

Click to show code
# Plot the seasonality in one chart 
ggseasonplot(tourism_ts, year.labels = TRUE, year.labels.left = TRUE)

Click to show code
# several chart per month to see the seasonality
ggsubseriesplot(tourism_ts) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

#debug
#better to use gg_subseries to see the seasonality
#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")

4.3.1.3 Trend

Click to show code
# Show the trend using moving average
# tourism_data_zurich_philippines %>%
#   autoplot(value) +
#   autolayer(ma(value, 12), series = "12-month moving average") +
#   labs(title = "Trend in the number of tourists from Philipines to Zurich",
#        x = "Date",
#        y = "Number of tourists") +
#   theme_minimal()

5 MODELLING

This part is about building on your knowledge of time series techniques to model your data. You can investigate various models but you should justify in your report your choices regarding these. Pay attention to the conditions that are needed to apply a specific model. Treat also carefully seasonality, outliers, colinearity, covariates, special events, etc. Remember the following steps:

  1. Aggregation choice for hierarchical time series
  2. Model building
  3. Model selection

5.1 Total number of visitors in Vaud

5.1.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?

Questions ?

5.1.2 ETS model

Click to show code
ets_vaud <- ets(vaud_ts, model = "AAA")
forecast_ets_vaud <- forecast(ets_vaud, h = 24) %>% plot(main = "Forecast of visitors in Vaud", xlab = "Date", ylab = "Number of visitors")

5.1.3 ARIMA

Click to show code
arima_vaud <- auto.arima(vaud_ts, seasonal = TRUE)

# Generate forecasts for the next 2 years (24 months)
forecast_arima_vaud <- forecast(arima_vaud, h = 24)

# Plot the forecast
plot(forecast_arima_vaud, main = "ARIMA Forecast for Vaud Tourism", xlab = "Date", ylab = "Number of Tourists")

Question: Forecast for each country or general forecast?

5.2 Zurich and Philipines

5.2.1 Outliers, Correlation, Colinearity, Covariates, Special Events ?

5.2.2 Special Event

5.2.2.1 Covid impact

5.2.2.1.1 How philipines dealt with it

In 2021, the Philippines faced the challenges of the COVID-19 pandemic with a mix of resilience and adaptation.

  • Lockdowns and Waves: The country experienced two waves of COVID-19, leading to prolonged lockdowns throughout the year. These restrictions aimed to curb the spread of the virus.
  • Vaccination Campaign: Despite challenges, millions of Filipinos received COVID-19 vaccines. However, experts raised concerns about the campaign’s sluggish pace.
  • Senate Investigations: Lawmakers probed alleged anomalies in pandemic contracts, leading to marathon Senate hearings.
  • Easing Restrictions: Towards the end of the year, daily cases dropped, and mandatory face shield policies were lifted. This signaled progress in overcoming the crisis.
  • Risk-Based Decisions: While the holidays looked promising, the threat of new variants remained. Experts advised practicing “risk-based” decisions for activities despite low case numbers1.
  • Filipinos have become more mindful of hygiene practices, including social distancing, mask-wearing, and proper handwashing. The pandemic also prompted a shift in consumption patterns, with increased focus on essentials and at-home entertainment. However, air travel remains limited due to ongoing concerns.

As for travel, the Philippines continues to navigate the balance between safety and economic recovery. While some travel restrictions have eased, travelers must stay informed about evolving guidelines and exercise caution when planning trips. The threat of new variants underscores the need for vigilance and informed decision-making1

Question : How to deal with this blackswan event ? - Incorporating Dummy Variables Introduce dummy variables into your forecasting models to account for the impact of COVID-19. These variables can be set to 1 for the periods affected by the pandemic and 0 otherwise. This approach allows the model to differentiate the impact of COVID-19 from normal variations in the data. - Replacing covid values with the ARIMA, If the missing values are random or if excluding them would result in a loss of valuable information, we might consider imputing them. One common approach is to use statistical models like ARIMA to interpolate missing values based on the patterns observed in the available data.

Click to show code
# analyse the impact of covid on the data from december 2019 to 2023
covid_impact <- tourism_data_zurich_philippines %>%
  filter(Date >= "2019-11-01" & Date <= "2023-01-01") %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line() +
  labs(title = "Impact of Covid on Tourism from Philipines to Zurich",
       x = "Date",
       y = "Number of Trips") +
  theme_minimal()
covid_impact

5.2.3 Naive Forecast

Click to show code
#convert tourism_ts to tsibble
tourism_ts <- tourism_ts %>% as_tsibble()
# Fit a naive model
fit <- tourism_ts %>%
  model(NAIVE = NAIVE(value))
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

5.2.4 Exponential Smoothing

5.2.4.1 Model ETS 1

  • Why additive errors ? Because the variance of the errors is constant over time no ?
Click to show code
# Fit an ETS model
# Adjusting the model parameters according to the characteristics of the data
# Here "A" means additive error, "N" means no trend, and "N" means no seasonality
# change these if needed
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("A"))) #multiplicative seasonality)
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

Clearly see here that the confidence interval is too big, almost like a naive forecast

Why trend dp and seaso M ? - Trend is present so A - Seasonality is present and growing over time so Multiplicative was chosen

Click to show code
# comparing several model
fit <- tourism_ts %>%
  model(
        ETS_M_seaso = ETS(value ~ error("A") + trend("A") + season("M")), #multiplicative seasonality
        ETS_M_seaso_Ad = ETS(value ~ error("A") + trend("Ad") + season("M")), #dampted trend
  )

# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, level = 90, color = "blue", alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

5.2.4.1.1 Thus Chosen Model :
Click to show code
fit <- tourism_ts %>%
  model(ETS_M_seaso = ETS(value ~ error("A") + trend("Ad") + season("M"))) #multiplicative seasonality)
# Forecast the next 2 years periods
forecast <- fit %>%
  forecast(h = 24)
# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposes
plot <- forecast %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "Forecast of tourists from Philipines to Zurich",
       x = "Date",
       y = "Number of tourists") + guides(colour = guide_legend(title = "Forecast"))
plot

5.2.5 ARIMA

Question : Do we need to differentiate the data ?

Click to show code
# Fit an automatic ARIMA model
fit_arima <- tourism_ts %>%
  model(ARIMA_auto = ARIMA(value))

# Forecast the next 2 years (24 months)
forecast_arima <- fit_arima %>%
  forecast(h = 24)

# Plot the forecasts along with the historical data
plot_arima <- forecast_arima %>%
  autoplot(tourism_ts, alpha = 0.5) +
  labs(title = "ARIMA Forecast of Tourists from the Philippines to Zurich",
       x = "Date",
       y = "Number of Tourists") +
  guides(colour = guide_legend(title = "Forecast"))
plot_arima

Ugly forecast, confidence interval is too big

5.2.5.1 Diagnostic

Click to show code
# Check the diagnostics of the ARIMA model
fit_arima %>% report()
#> Series: value 
#> Model: ARIMA(0,1,4)(0,0,2)[12] 
#> 
#> Coefficients:
#>           ma1     ma2      ma3     ma4   sma1    sma2
#>       -0.3889  0.0978  -0.1636  -0.155  0.418  0.1582
#> s.e.   0.0693  0.0809   0.0638   0.095  0.086  0.0688
#> 
#> sigma^2 estimated as 13178:  log likelihood=-1379
#> AIC=2771   AICc=2772   BIC=2795

autoplot(residuals(fit_arima))

Click to show code
# using auto.arima with stepwise and approximation options turned off for a more thorough search
fit_updated <- auto.arima(tourism_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
summary(fit_updated)
#> Series: tourism_ts 
#> ARIMA(4,1,0)(1,0,0)[12] 
#> 
#> Coefficients:
#>          ar1     ar2     ar3     ar4   sar1
#>       -0.407  -0.161  -0.236  -0.270  0.462
#> s.e.   0.064   0.071   0.073   0.066  0.074
#> 
#> sigma^2 = 12436:  log likelihood = -1373
#> AIC=2758   AICc=2758   BIC=2778
#> 
#> Training set error measures:
#>                ME RMSE  MAE   MPE MAPE  MASE   ACF1
#> Training set 5.55  110 66.9 -77.7  147 0.589 -0.041

6 Model Selection

Use Information Criteria for Model Comparison: Evaluate models based on criteria such as AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion), which help in selecting a model that balances goodness of fit with complexity.